We have already discussed the fact that slices and features have coordinates, but we have not defined exactly what these coordinates mean.
Ensembl, and many other bioinformatics applications, use inclusive coordinates which start at 1. The first nucleotide of a DNA sequence is 1 and the first amino acid of a protein sequence is also 1. The length of a sequence is defined as end - start + 1.
In some rare cases inserts are specified with a start which is one greater than the end. For example a feature with a start of 10 and an end of 9 would be a zero length feature between base pairs 9 and 10.
Slice coordinates are relative to the start of the underlying DNA sequence region. The strand of the slice represents its orientation relative to the default orientation of the sequence region. By convention the start of the slice is always less than the end, and does not vary with its strandedness. Most slices you will encounter will have a strand of 1, and this is what we will consider in our examples. It is legal to create a slice which extends past the boundaries of a sequence region. Sequence retrieved from regions where the sequence is not defined will consist of Ns.
All features retrieved from the database have an associated slice (accessible via the slice() method). A feature's coordinates are always relative to this associated slice, i.e. the start and end attributes define a feature's position relative to the start of the slice the feature is on (or the end of the slice if it is a negative strand slice). The strand attribute of a feature is relative to the strand of the slice. By convention the start of a feature is always less than or equal to the end of the feature regardless of its strand (except in the case of an insert). It is legal to have features with coordinates which are less than one or greater than the length of the slice. Such cases are common when features that partially overlap a slice are retrieved from the database.
Consider, for example, the following figure of two features associated with a slice:
[-----] (Feature A)
|================================| (Slice)
[--------] (Feature B)
A C T A A A T C T T G (Sequence)
1 2 3 4 5 6 7 8 9 10 11 12 13
The slice itself has a start of 2, an end of 13, and a length of 12 even though the underlying sequence region only has a length of 11. Retrieving the sequence of such a slice would give the string CTAAATCTTGNN — the undefined region of sequence is represented by Ns. Feature A has a start of 0, an end of 2, and a strand of 1. Feature B has a start of 3, an end of 6, and a strand of -1.
Sequences stored in Ensembl are associated with coordinate systems. What the coordinate systems are varies from species to species. For example, the homo_sapiens database has the following coordinate systems: contig, clone, supercontig, chromosome. Sequence and features may be retrieved from any coordinate system despite the fact they are only stored internally in a single coordinate system. The database stores the relationship between these coordinate systems and the API provides means to convert between them. The API has a CoordSystem object and object adaptor, however, these are most often used internally. The following example fetches a chromosome coordinate system object from the database:
my $cs_adaptor = $registry->get_adaptor( 'Human', 'Core', 'CoordSystem' );
my $cs = $cs_adaptor->fetch_by_name('chromosome');
printf "Coordinate system: %s %s\n", $cs->name(), $cs->version();
A coordinate system is uniquely defined by its name and version. Most coordinate systems do not have a version, and the ones that do have a default version, so it is usually sufficient to use only the name when requesting a coordinate system. For example, chromosome coordinate systems have a version which is the assembly that defined the construction of the coordinate system. The version of the human chromosome coordinate system might be something like NCBI35 or NCBI36, depending on the version of the Core databases used.
Slice objects have an associated CoordSystem object and a seq_region_name() method that returns its name which uniquely defines the sequence that they are positioned on. You may have noticed that the coordinate system of the sequence region was specified when obtaining a slice in thefetch_by_region() method. Similarly the version may also be specified (though it can almost always be omitted):
$slice =
$slice_adaptor->fetch_by_region( 'chromosome', 'X', 1e6, 10e6, '1', 'NCBI36' );
Sometimes it is useful to obtain full slices of every sequence in a given coordinate system; this may be done using the SliceAdaptor method fetch_all():
my @chromosomes = @{ $slice_adaptor->fetch_all('chromosome') };
my @clones = @{ $slice_adaptor->fetch_all('clone') };
Now suppose that you wish to write code which is independent of the species used. Not all species have the same coordinate systems; the available coordinate systems depends on the style of assembly used for that species (WGS, clone-based, etc.). You can obtain the list of available coordinate systems for a species using the CoordSystemAdaptor and there is also a special pseudo-coordinate system named toplevel. The toplevel coordinate system is not a real coordinate system, but is used to refer to the highest level coordinate system in a given region. The toplevel coordinate system is particularly useful in genomes that are incompletely assembled. For example, the latest zebrafish genome consists of a set of assembled chromosomes, and a set of supercontigs that are not part of any chromosome. In this example, the toplevel coordinate system sometimes refers to the chromosome coordinate system and sometimes to the supercontig coordinate system depending on the region it is used in.
# List all coordinate systems in this database:
my @coord_systems = @{ $cs_adaptor->fetch_all() };
foreach $cs (@coord_systems) {
printf "Coordinate system: %s %s\n", $cs->name(), $cs->version;
}
# Get all slices on the highest coordinate system:
my @slices = @{ $slice_adaptor->fetch_all('toplevel') };
Features on a slice in a given coordinate system may be moved to another slice in the same coordinate system or to another coordinate system entirely. This is useful if you are working with a particular coordinate system but you are interested in obtaining the features coordinates in another coordinate system.
The transform() method can be used to move a feature to any coordinate system which is in the database. The feature will be placed on a slice which spans the entire sequence that the feature is on in the requested coordinate system.
if ( my $new_feature = $feature->transform('clone') ) {
printf(
"Feature's clonal position is: %s %d-%d (%+d)\n",
$new_feature->slice->seq_region_name(),
$new_feature->start(), $feature->end(), $new_feature->strand()
);
} else {
print "Feature is not defined in clonal coordinate system\n";
}
The transform() method returns a copy of the original feature in the new coordinate system, or undef if the feature is not defined in that coordinate system. A feature is considered to be undefined in a coordinate system if it overlaps an undefined region or if it crosses a coordinate system boundary. Take for example the tiling path relationship between chromosome and contig coordinate systems:
|~~~~~~~| (Feature A) |~~~~| (Feature B)
(ctg 1) [=============]
(ctg 2) (------==========] (ctg 2)
(ctg 3) (--============] (ctg3)
Both Feature A and Feature B are defined in the chromosomal coordinate system described by the tiling path of contigs. However, Feature A is not defined in the contig coordinate system because it spans both Contig 1 and Contig 2. Feature B, on the other hand, is still defined in the contig coordinate system.
The special toplevel coordinate system can also be used in this instance to move the feature to the highest possible coordinate system in a given region:
my $new_feature = $feature->transform('toplevel');
printf(
"Feature's toplevel position is: %s %s %d-%d (%+d)\n",
$new_feature->slice->coord_system->name(),
$new_feature->slice->seq_region_name(),
$new_feature->start(),
$new_feature->end(),
$new_feature->strand()
);
Another method that is available on all Ensembl features is the transfer() method. The transfer() method is similar to the previously described transform() method, but rather than taking a coordinate system argument it takes a slice argument. This is useful when you want a feature's coordinates to be relative to a certain region. Calling transform() on the feature will return a copy of the which is shifted onto the provided slice. If the feature would be placed on a gap or across a coordinate system boundary, then undef is returned instead. It is illegal to transfer a feature to a slice on a sequence region which is cannot be placed on. For example, a feature which is on chromosome X cannot be transferred to a slice on chromosome 20 and attempting to do so will raise an exception. It is legal to transfer a feature to a slice on which it has coordinates past the slice end or before the slice start. The following example illustrates the use of the transfer method:
my $slice = $slice_adaptor->fetch_by_region( 'chromosome', '2',
1e6, 2e6 );
my $new_slice =
$slice_adaptor->fetch_by_region( 'chromosome', '2', 1.5e6, 2e6 );
foreach my $simple_feature ( @{ $slice->get_all_SimpleFeatures('Eponine') } ) {
printf(
"Before:\t%6d - %6d\n",
$simple_feature->start(),
$simple_feature->end()
);
my $new_feature = $simple_feature->transfer($new_slice); if (
!defined $new_feature ) { print "Could not transfer feature\n"; }
else { printf( "After:\t%6d - %6d\n", $new_feature->start(),
$new_feature->end() ); } }
In the above example a slice from another coordinate system could also have been used, provided you had an idea about what sequence region the features would be mapped to.